---
title: 'Have I been pwned? – DIY style'
url: /2022/08/pwned-diy/
draft: false
categories:
#- 🇬🇧
- en
- development
date: 2022-08-30 11:23:14+02:00
tags:
- password
- security
- cdb
- permacomputing
- sh
- sha
- awk
- CGI
- DIY
type: post
author: Marcus Rohrmoser
---

*tl;dr: look up sha1 sums via <https://mro.name/2022/pwned/passwords/> but beware: it
doesn't use the better [pwned api](https://haveibeenpwned.com/API).*

While the venerable [xkcd on password strength](https://xkcd.com/936/) discourages
alphabet soup, there's a thing even more important:

**Don't ever use leaked passwords!**

But how would you know? [Troy Hunt](https://en.wikipedia.org/wiki/Troy_Hunt)
maintains a [set of leaked passwords](https://haveibeenpwned.com/) you can test
your password candidate against online or download and test locally. (Online
testing does *not* involve uploading your password).

I show how to handle such a large dataset and have fast lookups using low profile
machinery – cheap hardware, djb's [cdb](https://cr.yp.to/cdb/cdb.txt) and some
shell/awk scripting.

## Get the dataset

be nice and download via torrent from <https://haveibeenpwned.com/Passwords>. I
just use http `curl` however.

## Slice it up

The whole set is way too big for a single cdb, so we split it into one file per
first hex character of the sha1 password hashes. Expect that to run for some days
and produce 16 files around 3.3G each:

<pre class="line-numbers"><code class="language-sh">#!/bin/sh

# map the first hex char of the sha to a database filename

# curl -LO 'https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v8.7z'
# sudo apt-get install p7zip
# p7zip -d pwned-passwords-sha1-ordered-by-hash-v8.7z
#
# revert:
# $ cdb -d pwned-passwords-v8-sha1-?.cdb | head | cut -d : -f 2 | sed 's/->/:/'
#
readonly raw="pwned-passwords-sha1-ordered-by-hash-v8.txt"

date
echo "segmenting"
cat "${raw}" \
  | tr -d '\015' \
  | tr ':' ' ' \
  | awk '//{f=substr($0,1,1);print >> f;fflush(f)}'

for c in 0 1 2 3 4 5 6 7 8 9 A B C D E F
do
  echo "shard ${c}"
  cdb -c -m "pwned-passwords-v8-sha1-${c}.cdb" \
    < "${c}" \
    && rm "${c}"
done
date
</code></pre>

## Query

A cgi is enough to look up the counter for a sha:

<pre class="line-numbers"><code class="language-sh">#!/bin/sh

do_retry () {
cat &lt;&lt;EOF
Status: 303 See Other
Location: .
Content-Type: text/plain

Retry
EOF
exit
}

# qs="&${QUERY_STRING}"
qs="&$(cat)" # POST to not log the sha1.

case "$(echo -n "${qs}" | cut -c 1-6)" in
"&sha1=")
  sha1="$(echo "${qs}" | cut -c 7-46 | tr 'abcdef' 'ABCDEF')"
  ;;
*) do_retry ;;
esac

tpl="tpl.omg-yes.html"

shard="$(echo "${sha1}" | cut -c 1)"
# https://stackoverflow.com/a/39360056/349514
count="$(cdb -q "/home/mro/Downloads/pwned-passwords-v8-sha1-${shard}.cdb" "${sha1}" 2>/dev/null)"
[ $? = 0 ] || tpl="tpl.fine-no.html"

cat &lt;&lt;EOF
Status: 200 Ok
Content-Type: text/html; charset=utf-8

EOF
export count
envsubst < "${tpl}"
</code></pre>

## Conclusion

The older the tools, the better they work on cheap computers and plain text is a
powerful data format.

The tools used are all decades old:

* [awk](https://en.wikipedia.org/wiki/AWK) 1977,
* [cgi](https://www.ietf.org/rfc/rfc3875.html) 1993,
* [cdb](https://cr.yp.to/cdb/cdb.txt) 1996,
* [curl](https://curl.haxx.se) 1998,
* [tinycdb](http://www.corpit.ru/mjt/tinycdb.html) 2001,
* [envsubst](https://manpages.debian.org/envsubst) 2003,

and work so well, *because* they don't use fancy formats as xml or json but plain
text.

